Boston is a metropolitan city and home to more than 11,000 food establishments Leung, 2018. Bostonians have plethora of dining choices that are available to them. This is an amazing perks for Bostonians who love to eat out and try out different food scenes; however, this also means that customers are faced with many different options. Choosing one will not be easy. Food and restaurant safety–measured by proxy of passing health inspections–is one of the most important factors that customers should think about when dining out. However, information on health inspection violations might not be readily accessible to customers. For example, Boston CBS wrote that “almost every Boston restaurant gets an”A" grade, even those with flarant health code records“, and that these grades do not show the full history of restaurants’ violations. Restaurants that severely violated the inspection codes could regain their”A" grade after passing their next inspections; and some restaurants had shown patterns of repeatedly cleaning up their acts after failing inspections (Boston CBS). Knowing the recent statistics that there are more than 1,600 health inspection violations at Boston restaurants in 2017 Leung, 2018, this piece of information is valuable for customers. To tackle this, we are interested in looking at food establishment inspections data provided by the City of Boston to analyze the trends in health inspections violations amongs restaurants inspected in Greater Boston. Specifically, we are interested in looking at the patterns of violations across the years, food establishment types, and geographical area. #Initial Questions: What questions are you trying to answer? How did these questions evolve over the course of the project? What new questions did you consider in the course of your analysis?
=========================================================================================================== #Data Cleaning Data downloaded at 11/8/2018 at 4:44pm. Data saved as csv file. There are 547700 observations with 26 variables.
Current info: - businessName (0 missing) - DBAName (suggest drop, reason: missing values) - LegalOwner (suggest drop, reason: missing values) - NameLast - NameFirst (suggest drop, reason: some are informative–e.g. variations on “inc”) - LICENSENO - ISSDTTM - EXPDTTM - LICSTATUS (“Active”, “Deleted”, “Inactive”) –> combine “Deleted” and “Inactive”? - LICENSECAT (“FS” “FT” “MFW” “RF”) - DESCRIPT (“Eating & Drinking” “Eating & Drinking w/ Take Out” “Mobile Food Walk On” “Retail Food”) - RESULT [1] “DATAERR” “Fail” “Failed” “HE_Closure” “HE_Fail” “HE_FailExt” [7] “HE_FAILNOR” “HE_Filed” “HE_Hearing” “HE_Hold” “HE_Misc” “HE_NotReq” [13] “HE_OutBus” “HE_Pass” “HE_TSOP” “Pass” “PassViol”
RESULT n
After looking at the data, I would suggest trimming them.. 35,751 have different years between RESULTDTTM vs VIOLDTTM. I would suggest going with VIOLDTTM and ViolDesc, since we (or I) have no clue what RESULT means. 31,240 observation (5.7%) is missing the violation years. Options: 1) impute from the RESULTSDTTM, 2) complete case analysis. For this case, we stick with complete case, excluding any observations without date (thus years) of violation, description of violation (1 obs), and violation status (5744 observations). Total observations are 510,715.
keeps <- c("businessName","NameLast", "LICENSENO","LICSTATUS", "DESCRIPT","ViolDesc","VIOLDTTM", "ViolStatus", "Address","CITY", "ZIP", "ViolLevel")
library(lubridate)
violation<-test[keeps]
violation <- violation %>% mutate(year = year(VIOLDTTM))
ccvioly <- violation[!is.na(violation$year), ]
ccvioly <- ccvioly[!is.na(ccvioly$ViolDesc),]
ccvioly <- ccvioly[!is.na(ccvioly$ViolStatus),]
ccvioly <- ccvioly[!is.na(ccvioly$ZIP),]
ccvioly<-ccvioly[which(ccvioly$ViolDesc != ""),]
ccvioly<-ccvioly[which(ccvioly$ZIP!= "0"),]
summary(ccvioly)
## businessName NameLast
## Subway : 2243 CVS PHARMACY INC. : 1952
## Dunkin Donuts : 1942 c/o Lyons Group : 1794
## Mcdonalds : 1641 STARBUCKS CORPORATION : 1741
## Dunkin' Donuts: 1524 Shaw's Supermarket Inc. : 1519
## Au Bon Pain : 1485 WHOLE FOODS MARKET GROUP INC.: 1479
## McDonald's : 1373 Stop & Shop Supermarket Co. : 1375
## (Other) :499807 (Other) :500155
## LICENSENO LICSTATUS DESCRIPT
## Min. : 4032 Active :338921 Eating & Drinking :247160
## 1st Qu.: 21261 Deleted : 27 Eating & Drinking w/ Take Out:197305
## Median : 24711 Inactive:171067 Mobile Food Walk On : 4095
## Mean : 46539 Retail Food : 61455
## 3rd Qu.: 37246
## Max. :379898
##
## ViolDesc
## Non-Food Contact Surfaces Clean : 40895
## Improper Maintenance of Walls/Ceilings: 37396
## Non-Food Contact Surfaces : 33021
## Improper Maintenance of Floors : 31162
## Food Protection : 28422
## Food Contact Surfaces Clean : 24762
## (Other) :314357
## VIOLDTTM ViolStatus
## Min. :2006-12-13 13:31:00 Fail:283352
## 1st Qu.:2010-04-16 12:24:44 Pass:226663
## Median :2013-06-05 10:57:31
## Mean :2013-04-24 18:18:37
## 3rd Qu.:2016-03-15 13:05:36
## Max. :2018-11-08 14:15:05
##
## Address CITY
## 1 Citywide ST : 3766 Boston :189836
## 417 Washington ST : 2398 Dorchester : 51222
## 800 BOYLSTON ST : 2395 BOSTON : 48403
## 200 Faneuil Hall Market PL : 2326 Roxbury : 37799
## 100-199 Faneuil Hall Market PL: 2282 East Boston: 26316
## 300 FANEUIL HALL MARKET PL : 2057 (Other) :156419
## (Other) :494791 NA's : 20
## ZIP ViolLevel year
## Min. :2050 * :386457 Min. :2006
## 1st Qu.:2115 ** : 32442 1st Qu.:2010
## Median :2121 *** : 91116 Median :2013
## Mean :2127 1919: 0 Mean :2013
## 3rd Qu.:2129 3rd Qu.:2016
## Max. :2467 Max. :2018
##
We could try to categorize these violations to violation categories: (http://www.foodprotection.org/files/food-protection-trends/MarApr-14-Burke-Manes.pdf) 1) Floors, Walls, and Ceilings “Fixture’s not properly shielded” “Floors Designed Constructed Installed” “Improper Maintenance of Floors” “Improper Maintenance of Walls/Ceilings” “Walls/Ceilings Designed Constructed Installed”
Food Equipment and Utensils “Clean Equipment & Utensils Storage” “Dishwashng Facilities” “Equipment Thermometers” “Food Contact Surfaces Clean” “Food Contact Surfaces Design” “Food Thermometers Provided” “Food Utensil Storage” “Improper Storage of Re-usable Utensils” “Inadequate Facilities/Cooling Methods” “Installed and Maintained” “Non-Food Contact Surfaces” “Non-Food Contact Surfaces Clean” “Pre-Flushed Scrapped Soaked” “Re-use of Single Service Articles” “Recieving/Sound Condition/Proper Temperatures” “Single Service Articles Stored Dispensed” “Test Kit Provided” “Three Compartment Sink” “Wash Rinse Water Clean Proper Temperature.” “Wiping Cloths Clean Sanitize”
Food Protection “Cold Holding” “Hot Holding” “Conformance w/ Approved Procedures” “Cooking Temperatures” “Cooling”
“Food products protected against contamination(Properly covered/secured for overnight storage).” “Food Protection” “Food Restrictions & Preparation” “Handling of Food & Ice” “PHF’s Properly Thawed” “Reduced Oxygen Packaging” “Reheating” “Reservice of PHF or Unwrapped Foods” “Separation Segregation Cross Contamination” “Washing fruits and veg’s.”
Toilet and hand-washing facilities “Adequate Handwashing/Where/When/How” “Hand Cleaner Drying Tissue Signage” “Location Accessible”
“Low Sanitizer Alarm” “Number Convenient” “Prevention of Contamination from Hands” “Separation/Sanitizer Criteria” “Toilet Enclosed Clean”
Food “Approved Food or Color Additives” “Approved Source” “Food Container Labels” “Labeled Common Name” “Labeling of Ingredients” “Shellstock ID” “Spoilage Unsafe Food” “Tags & Records”
Plumbing “Cross Connection Back Siphonage Backflow” “Sewage and Waste Water”
Lighting “Inadequate Lighting”
Insect, Rodent, Animal Control “Adequate Number Frequency Vermin Proof” “Insects Rodents Animals” “Parasite Destruction” “Pesticide Usage”
Garbage and Refuse Disposal “Improper Cleaning of Receptacles” “Outside Storage Improperly Maintained”
Personnel “Clean Cloths Hair Restraint” “Good Hygienic Practices” “Non-Compliance w/Employee Health Policy” “Person in charge Assigned” “Personnel w/ Infections Restricted/Excluded” “PIC Knowledge” PIC Performing Duties" “Safe Food Handeling Instructions” “Times as a Public Health Control” “Tobacco”
Water “Hot and Cold Water” “Unsafe Water”
Dressing Rooms “Dressing Rooms Clean/Lockers Provided”
Other “Anti-Choking” “Consumer Advisories”
“Living/Sleeping Quaters/Laundry” “Medicines FirstAid Storage” “Mop Sink not Provided” “Premises Maintained” “Proper storage of clean linen” “Public Nuisance” “Rooms and Equipment Vented” “Soiled Linen Storage” “Toxic Items: Original Container”
levels(ccvioly$ViolDesc)
## [1] "Adequate Handwashing/Where/When/How"
## [2] "Adequate Number Frequency Vermin Proof"
## [3] "Anti-Choking"
## [4] "Approved Food or Color Additives"
## [5] "Approved Source"
## [6] "Clean Cloths Hair Restraint"
## [7] "Clean Equipment & Utensils Storage"
## [8] "Cold Holding"
## [9] "Conformance w/ Approved Procedures"
## [10] "Consumer Advisories"
## [11] "Cooking Temperatures"
## [12] "Cooling"
## [13] "Cross Connection Back Siphonage Backflow"
## [14] "Dishwashng Facilities"
## [15] "Dressing Rooms Clean/Lockers Provided"
## [16] "Equipment Thermometers"
## [17] "Fixture's not properly shielded"
## [18] "Floors Designed Constructed Installed"
## [19] "Food Contact Surfaces Clean"
## [20] "Food Contact Surfaces Design"
## [21] "Food Container Labels"
## [22] "Food products protected against contamination(Properly covered/secured for overnight storage)."
## [23] "Food Protection"
## [24] "Food Restrictions & Preparation"
## [25] "Food Thermometers Provided"
## [26] "Food Utensil Storage"
## [27] "Good Hygienic Practices"
## [28] "Hand Cleaner Drying Tissue Signage"
## [29] "Handling of Food & Ice"
## [30] "Hot and Cold Water"
## [31] "Hot Holding"
## [32] "Improper Cleaning of Receptacles"
## [33] "Improper Maintenance of Floors"
## [34] "Improper Maintenance of Walls/Ceilings"
## [35] "Improper Storage of Re-usable Utensils"
## [36] "Inadequate Facilities/Cooling Methods"
## [37] "Inadequate Lighting"
## [38] "Insects Rodents Animals"
## [39] "Installed and Maintained"
## [40] "Labeled Common Name"
## [41] "Labeling of Ingredients"
## [42] "Living/Sleeping Quaters/Laundry"
## [43] "Location Accessible"
## [44] "Low Sanitizer Alarm"
## [45] "Medicines FirstAid Storage"
## [46] "Mop Sink not Provided"
## [47] "Non-Compliance w/Employee Health Policy"
## [48] "Non-Food Contact Surfaces"
## [49] "Non-Food Contact Surfaces Clean"
## [50] "Number Convenient"
## [51] "Outside Storage Improperly Maintained"
## [52] "Parasite Destruction"
## [53] "Person in charge Assigned"
## [54] "Personnel w/ Infections Restricted/Excluded"
## [55] "Pesticide Usage"
## [56] "PHF's Properly Thawed"
## [57] "PIC Knowledge"
## [58] "PIC Performing Duties"
## [59] "Pre-Flushed Scrapped Soaked"
## [60] "Premises Maintained"
## [61] "Prevention of Contamination from Hands"
## [62] "Proper storage of clean linen"
## [63] "Public Nuisance"
## [64] "Re-use of Single Service Articles"
## [65] "Recieving/Sound Condition/Proper Temperatures"
## [66] "Reduced Oxygen Packaging"
## [67] "Reheating"
## [68] "Reservice of PHF or Unwrapped Foods"
## [69] "Rooms and Equipment Vented"
## [70] "Safe Food Handeling Instructions"
## [71] "Separation Segregation Cross Contamination"
## [72] "Separation/Sanitizer Criteria"
## [73] "Sewage and Waste Water"
## [74] "Shellstock ID"
## [75] "Single Service Articles Stored Dispensed"
## [76] "Soiled Linen Storage"
## [77] "Spoilage Unsafe Food"
## [78] "Tags & Records"
## [79] "Test Kit Provided"
## [80] "Three Compartment Sink"
## [81] "Times as a Public Health Control"
## [82] "Tobacco"
## [83] "Toilet Enclosed Clean"
## [84] "Toxic Items: Original Container"
## [85] "Unsafe Water"
## [86] "Walls/Ceilings Designed Constructed Installed"
## [87] "Wash Rinse Water Clean Proper Temperature."
## [88] "Washing fruits and veg's."
## [89] "Wiping Cloths Clean Sanitize"
ccvioly %>% group_by(ViolDesc) %>% summarise(n=length(businessName))
## # A tibble: 88 x 2
## ViolDesc n
## <fct> <int>
## 1 Adequate Handwashing/Where/When/How 3289
## 2 Adequate Number Frequency Vermin Proof 5275
## 3 Anti-Choking 1092
## 4 Approved Food or Color Additives 64
## 5 Approved Source 1166
## 6 Clean Cloths Hair Restraint 4844
## 7 Clean Equipment & Utensils Storage 4545
## 8 Cold Holding 12781
## 9 Conformance w/ Approved Procedures 1393
## 10 Consumer Advisories 3526
## # ... with 78 more rows
ccvioly %>% group_by(ViolStatus) %>% summarise(n=length(businessName))
## # A tibble: 2 x 2
## ViolStatus n
## <fct> <int>
## 1 Fail 283352
## 2 Pass 226663
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0 ✔ readr 1.1.1
## ✔ tibble 1.4.2 ✔ purrr 0.2.5
## ✔ tidyr 0.8.1 ✔ stringr 1.3.1
## ✔ ggplot2 3.1.0 ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ lubridate::as.difftime() masks base::as.difftime()
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ lubridate::intersect() masks base::intersect()
## ✖ dplyr::lag() masks stats::lag()
## ✖ lubridate::setdiff() masks base::setdiff()
## ✖ lubridate::union() masks base::union()
ccvioly$violcats <- fct_collapse(ccvioly$ViolDesc,
"Floors, Walls, and Ceilings" = c("Fixture's not properly shielded","Floors Designed Constructed Installed","Improper Maintenance of Floors","Improper Maintenance of Walls/Ceilings","Walls/Ceilings Designed Constructed Installed"),
"Food Equipment and Utensils" = c("Clean Equipment & Utensils Storage",
"Dishwashng Facilities", "Equipment Thermometers","Food Contact Surfaces Clean","Food Contact Surfaces Design", "Food Thermometers Provided", "Food Utensil Storage" ,"Improper Storage of Re-usable Utensils", "Inadequate Facilities/Cooling Methods" ,"Installed and Maintained", "Non-Food Contact Surfaces" , "Non-Food Contact Surfaces Clean" ,"Pre-Flushed Scrapped Soaked", "Re-use of Single Service Articles","Recieving/Sound Condition/Proper Temperatures", "Single Service Articles Stored Dispensed", "Test Kit Provided"
,"Three Compartment Sink", "Wash Rinse Water Clean Proper Temperature.", "Wiping Cloths Clean Sanitize"),
"Food Protection" = c("Cold Holding" ,"Hot Holding" ,"Conformance w/ Approved Procedures","Cooking Temperatures","Cooling","Food products protected against contamination(Properly covered/secured for overnight storage).","Food Protection" ,"Food Restrictions & Preparation","Handling of Food & Ice","PHF's Properly Thawed","Reduced Oxygen Packaging" ,"Reheating","Reservice of PHF or Unwrapped Foods","Separation Segregation Cross Contamination","Washing fruits and veg's."),
"Toilet and hand-washing facilities" = c("Adequate Handwashing/Where/When/How","Hand Cleaner Drying Tissue Signage","Location Accessible", "Low Sanitizer Alarm", "Number Convenient" , "Prevention of Contamination from Hands", "Separation/Sanitizer Criteria", "Toilet Enclosed Clean"),
"Food" = c("Approved Food or Color Additives",
"Approved Source" ,"Food Container Labels" ,"Labeled Common Name","Labeling of Ingredients" ,"Shellstock ID","Spoilage Unsafe Food","Tags & Records"),
"Plumbing"=c("Cross Connection Back Siphonage Backflow",
"Sewage and Waste Water"),
"Lighting"=c("Inadequate Lighting"),
"Insect, Rodent, Animal Control"=c("Adequate Number Frequency Vermin Proof","Insects Rodents Animals","Parasite Destruction", "Pesticide Usage"),
"Garbage and Refuse Disposal" = c("Improper Cleaning of Receptacles","Outside Storage Improperly Maintained"),
"Personnel"= c("Clean Cloths Hair Restraint","Good Hygienic Practices","Non-Compliance w/Employee Health Policy","Person in charge Assigned" ,"Personnel w/ Infections Restricted/Excluded","PIC Knowledge","PIC Performing Duties", "Safe Food Handeling Instructions","Times as a Public Health Control", "Tobacco"),
"Water" = c("Hot and Cold Water","Unsafe Water"),
"Dressing Rooms" =c("Dressing Rooms Clean/Lockers Provided"),
"Others"= c("Anti-Choking", "Consumer Advisories", "Living/Sleeping Quaters/Laundry","Medicines FirstAid Storage","Mop Sink not Provided","Premises Maintained","Proper storage of clean linen","Public Nuisance" ,"Rooms and Equipment Vented","Soiled Linen Storage","Toxic Items: Original Container")
)
levels(ccvioly$violcats)
## [1] "Toilet and hand-washing facilities"
## [2] "Insect, Rodent, Animal Control"
## [3] "Others"
## [4] "Food"
## [5] "Personnel"
## [6] "Food Equipment and Utensils"
## [7] "Food Protection"
## [8] "Plumbing"
## [9] "Dressing Rooms"
## [10] "Floors, Walls, and Ceilings"
## [11] "Water"
## [12] "Garbage and Refuse Disposal"
## [13] "Lighting"
sumx <-ccvioly %>% group_by(LICENSENO,year,violcats, ViolStatus) %>% summarise(n=n(), name=paste(unique(businessName)), LICSTATUS=paste(unique(LICSTATUS)), DESCRIPT=paste(unique(DESCRIPT)), ZIP=paste(unique(ZIP)))
sumviol <- ccvioly %>% filter(., ViolStatus == "Fail") %>% group_by(LICENSENO,year,violcats) %>%
summarise(n=n(), name=paste(unique(businessName)), LICSTATUS=paste(unique(LICSTATUS)), DESCRIPT=paste(unique(DESCRIPT)),
YEAR=paste(unique(year)), violcat=paste(unique(violcats)), ZIP=paste(unique(ZIP)))
sumpass <- ccvioly %>% filter(., ViolStatus == "Pass") %>% group_by(LICENSENO,year,violcats) %>%
summarise(n=n(), name=paste(unique(businessName)), LICSTATUS=paste(unique(LICSTATUS)), DESCRIPT=paste(unique(DESCRIPT)),
YEAR=paste(unique(year)), violcat=paste(unique(violcats)), ZIP=paste(unique(ZIP)))
sumx is dataset that groups the data by license number, year of violation, violation categories, and violation status; collapse them based on these, and count the # of rows based on these. Thus it represents for a particular food establishment at year X, what would be the # of failing/passing violations for a particular violation types.
finalviol <- sumx[,c(1,6,8,9,2,7,3,4,5)]
head(finalviol)
## # A tibble: 6 x 9
## # Groups: LICENSENO, year, violcats [6]
## LICENSENO name DESCRIPT ZIP year LICSTATUS violcats ViolStatus n
## <int> <chr> <chr> <chr> <dbl> <chr> <fct> <fct> <int>
## 1 4032 Stop… Retail … 2131 2007 Inactive Toilet … Fail 2
## 2 4032 Stop… Retail … 2131 2007 Inactive Insect,… Fail 2
## 3 4032 Stop… Retail … 2131 2007 Inactive Food Eq… Fail 10
## 4 4032 Stop… Retail … 2131 2007 Inactive Food Pr… Fail 2
## 5 4032 Stop… Retail … 2131 2007 Inactive Floors,… Fail 4
## 6 4032 Stop… Retail … 2131 2007 Inactive Garbage… Fail 4
write.csv(finalviol, 'finalviol.csv')
We then uploaded this finalviol.csv to github
=========================================================================================================== #Restaurant Violations Analysis based on Years of Inspections The goal of this analysis is to look at the trends of restaurant violations based on the years of inspections performed.
library(tidyverse)
library(dplyr)
library(ggplot2)
Starting by exploring the contents of the datset.
summary(finalviol)
## X LICENSENO name
## Min. : 1 Min. : 4032 Subway : 1173
## 1st Qu.: 55772 1st Qu.: 21420 Dunkin Donuts : 1152
## Median :111544 Median : 25121 Mcdonalds : 822
## Mean :111544 Mean : 48590 Dunkin' Donuts: 816
## 3rd Qu.:167316 3rd Qu.: 67870 Au Bon Pain : 676
## Max. :223087 Max. :379898 Burger King : 649
## (Other) :217799
## DESCRIPT ZIP year
## Eating & Drinking :102258 Min. :2050 Min. :2006
## Eating & Drinking w/ Take Out: 86013 1st Qu.:2115 1st Qu.:2010
## Mobile Food Walk On : 2261 Median :2122 Median :2013
## Retail Food : 32555 Mean :2128 Mean :2013
## 3rd Qu.:2130 3rd Qu.:2016
## Max. :2467 Max. :2018
##
## LICSTATUS violcats
## Active :148255 Food Equipment and Utensils :49731
## Deleted : 12 Floors, Walls, and Ceilings :38450
## Inactive: 74820 Food Protection :30427
## Toilet and hand-washing facilities:25969
## Others :20704
## Personnel :17433
## (Other) :40373
## ViolStatus n
## Fail:117866 Min. : 1.000
## Pass:105221 1st Qu.: 1.000
## Median : 1.000
## Mean : 2.286
## 3rd Qu.: 3.000
## Max. :64.000
##
table(finalviol$year, finalviol$ViolStatus)
##
## Fail Pass
## 2006 0 2
## 2007 5498 6035
## 2008 10865 9459
## 2009 9928 8588
## 2010 8879 7657
## 2011 8669 7462
## 2012 10730 9585
## 2013 9781 8589
## 2014 10467 9310
## 2015 10426 9349
## 2016 11519 10489
## 2017 10177 8975
## 2018 10927 9721
Since there are only two observations for 2006, I’ve filtered out these observations from the subsequent analyses.
Looking at the trend in violation status of food inspections from 2007 to 2018, we see that there is a large jump in the number of food inspections that were conducted from 2007 to 2008. The number of food inspections declines between 2009 and 2011 and then picks up and remains fairly constant starting in 2012.
finalviol %>% filter (year %in% c(2007,2008, 2009, 2010,2011, 2012, 2013, 2014, 2015, 2016, 2017,2018))%>% group_by(year) %>%
ggplot(aes(ViolStatus, fill=ViolStatus))+ geom_bar() +
xlab ("Violation Status")+
ggtitle("Violation Status of Food Inspections from 2007-2018")+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))+
facet_grid(.~year)
Throughout the years, around 53% of the food inspections have resulted in a Failed inspection.
finalviol %>% filter (year !=2006) %>%
group_by(year) %>%
summarize(number_fail = sum(ViolStatus=="Fail"),number_pass =sum(ViolStatus=="Pass"),
total = n(), percent_fail=number_fail/total)
## # A tibble: 12 x 5
## year number_fail number_pass total percent_fail
## <int> <int> <int> <int> <dbl>
## 1 2007 5498 6035 11533 0.477
## 2 2008 10865 9459 20324 0.535
## 3 2009 9928 8588 18516 0.536
## 4 2010 8879 7657 16536 0.537
## 5 2011 8669 7462 16131 0.537
## 6 2012 10730 9585 20315 0.528
## 7 2013 9781 8589 18370 0.532
## 8 2014 10467 9310 19777 0.529
## 9 2015 10426 9349 19775 0.527
## 10 2016 11519 10489 22008 0.523
## 11 2017 10177 8975 19152 0.531
## 12 2018 10927 9721 20648 0.529
finalviol_year <- finalviol %>% filter (year !=2006) %>%
group_by(year) %>%
summarize(number_fail = sum(ViolStatus=="Fail"),number_pass =sum(ViolStatus=="Pass"),
total = n(), percent_fail=(number_fail/total*100))
finalviol_year %>% summarize(meanfail = mean(percent_fail))
## # A tibble: 1 x 1
## meanfail
## <dbl>
## 1 52.7
finalviol_year %>% ggplot() +
geom_line(aes (x=year, y=percent_fail), size = 1, color = "red")+
ylim(45,55)+
geom_point(aes (x=year, y=percent_fail), size = 2, color = "black")+
xlab("Year")+
ylab("% of Failed Inspection Test")+
ggtitle("Percentage of Failed Inspection Test from 2007-2018")
Next,looking at the distribution of number of violations of those that Failed the Inspection Test, we see that across the years, the category with the largest count of violations is for number of violations <= 5.
finalviol %>% filter (ViolStatus=="Fail", year %in% c(2007,2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017,2018))%>%
ggplot(aes(n)) +
geom_histogram (binwidth = 3, color = "black")+
xlim(0, 20)+
facet_wrap(.~year, ncol=6)+ xlab("Number of Violations")+
ggtitle("Distribution of no. violations of establishments that Failed Inspection Test")
## Warning: Removed 265 rows containing non-finite values (stat_bin).
## Warning: Removed 24 rows containing missing values (geom_bar).
However, we see from the boxplot below that there is a wide spread in the number of violations, with some inspections detecting over 40 counts of violations!
finalviol %>% filter (year >2007 & ViolStatus=="Fail" ) %>%
ggplot(aes(x=year, y = n, group=year))+
geom_boxplot()+
ylab("Number of Violations")+
ggtitle("Boxplot of No. of Violations of Establishments that Failed Inspection Test")
Adding the total number of violations for each ViolStatus category, grouped by year:
finalviol1 <- finalviol %>% filter (year !=2006 & ViolStatus=="Fail") %>%
group_by(year) %>%
summarize(total_no_violations_fail = sum(n))
finalviol2 <-finalviol %>% filter (year !=2006 & ViolStatus=="Pass") %>%
group_by(year) %>%
summarize(total_no_violations_pass = sum(n))
finalviol_number<- left_join(finalviol1,finalviol2)
## Joining, by = "year"
finalviol_number<- finalviol_number %>%
mutate(total_no_violations_diff = abs((total_no_violations_fail) - (total_no_violations_pass)))
In this graph below, the red line represents the trend in the total number of violations for those that failed Inspection. Similarly, the blue line represents the trend for those that Passed the Inspection. Trend lines for the two groups are very similar, with the Failed Inspections reporting between 4,000-6,000 more number of violations.
finalviol_number %>%
ggplot()+
geom_line(aes (x=year, y = total_no_violations_fail), size =0.5, color = "red")+
geom_point(aes (x=year, y= total_no_violations_fail), size = 2, color = "black")+
geom_line(mapping=aes(x=year, y=total_no_violations_pass), size=0.5, color="blue")+
geom_point(aes (x=year, y= total_no_violations_pass), size = 2, color = "black")+
xlab("Year") +
ylab(" Total No. of Violations ") +
ggtitle("Total Number of Violations from 2007-2018") +
geom_text(aes(x= 2013, y = 15000, label="Pass"), color="blue")+
geom_text(aes(x= 2010, y = 25000, label="Fail"), color="red")
This graph below shows the difference in the number of violations based on violation status.
finalviol_number %>%
ggplot()+
geom_line(aes (x=year, y = total_no_violations_diff), size =0.5, color = "purple")+
geom_point(aes (x=year, y= total_no_violations_diff), size = 2, color = "black")+
xlab("Year") +
ylab(" Difference in No. of Violations ") +
ggtitle("Difference in Number of Passed vs Failed Inspection Annually")
============================================================================================================= #Analysis based on restaurant type
The goal of this analysis is to explore differences in restaurants inspections violations based on the restaurant and restaurant type.
library(tidyverse)
library(gridExtra)
finalviol <- read.csv("https://raw.githubusercontent.com/plimasalle/Food_Project/master/finalviol.csv")
finalviol <- finalviol %>%
filter(year > 2006, LICSTATUS == "Active") %>%
mutate(violstatus_fail = ifelse(ViolStatus == "Fail", 1, 0))
finalviol_anyfail <- finalviol %>%
group_by(DESCRIPT, name, ZIP, LICENSENO, year) %>%
summarize(fails = sum(violstatus_fail), num_cat = n()) %>%
mutate(anyfails = ifelse(fails > 0, 1, 0), fivefails = ifelse(fails > 4, 1, 0), pcnt_fails = fails/num_cat)
finalviol_anyfail %>% group_by(DESCRIPT) %>%
summarize(mean(anyfails), n())
## # A tibble: 4 x 3
## DESCRIPT `mean(anyfails)` `n()`
## <fct> <dbl> <int>
## 1 Eating & Drinking 0.988 9988
## 2 Eating & Drinking w/ Take Out 0.984 7959
## 3 Mobile Food Walk On 0.997 308
## 4 Retail Food 0.981 3696
There are four main categories of restaurants in the dataset: 1. Eating & Drinking 2. Eating & Drinking w/ Take Out 3. Mobile Food Walk On 4. Retail Food
Across all years, the vast majority of restaurants had at least one failure across the violation categories inspected (ranging from 98.1% among Retail Food to 99.7% among Mobile Food Walk On restaurants).
Note: These analyses are limited to restaurants with active licenses only, but these numbers did not change substantially (fractions of percentage points) when restaurants with inactive licenses were also included.
finalviol_anyfail %>% group_by(DESCRIPT) %>%
summarize(mean(fivefails), n())
## # A tibble: 4 x 3
## DESCRIPT `mean(fivefails)` `n()`
## <fct> <dbl> <int>
## 1 Eating & Drinking 0.338 9988
## 2 Eating & Drinking w/ Take Out 0.313 7959
## 3 Mobile Food Walk On 0.166 308
## 4 Retail Food 0.227 3696
Across all years, there was more variation across restaurant categories in the percent of restaurants per category that had at least five failures across violation categories. About a third of Eating and Drinking and Eating & Drinking w/ Take Out restaurants, about a fifth of Retail Food restaurants, and about a sixth of Mobile Food Walk On restuarants had at least five failures.
finalviol_anyfail %>% filter(anyfails > 0) %>%
group_by(DESCRIPT) %>%
summarize(mean(pcnt_fails), n())
## # A tibble: 4 x 3
## DESCRIPT `mean(pcnt_fails)` `n()`
## <fct> <dbl> <int>
## 1 Eating & Drinking 0.590 9870
## 2 Eating & Drinking w/ Take Out 0.601 7832
## 3 Mobile Food Walk On 0.597 307
## 4 Retail Food 0.680 3627
Among restaurants that failed at least one inspection category, Retail Food restaurants overall failed the highest percent of violation categories inspected with the average Retail Food restaurant failing 67.9% of the violation categories inspected. Eating & Drinking restaurants failed the lowest percent of categories inspected with the average Eating & Drinking restaurant failing 58.9% of violation categories inspected.
finalviol_typeyear <- finalviol_anyfail %>%
filter(anyfails > 0) %>%
group_by(DESCRIPT, year) %>%
summarize(avg_pcnt_fails = mean(pcnt_fails), count = n())
ggplot(finalviol_typeyear, aes(year, avg_pcnt_fails, color = DESCRIPT)) +
geom_line() +
scale_x_continuous(name = "Year") +
scale_y_continuous(name = "Percent of Violation Categories Failed", limits = c(0, 0.8))
With the exception of Mobile Food Walk On, all restuarants that failed at least one violation category inspected tended to fail a similar percent of violation categories inspected across years. This variation for Mobile Food Walk On may be driven by the low number of Mobile Food Walk On restaurants (ranging from 1 in 2009 to a peak of 102 in 2018).
finalviol_cat <- finalviol %>%
group_by(DESCRIPT, year, violcats) %>%
summarize(avg_cat = mean(violstatus_fail), count = n())
ggplot(filter(finalviol_cat, DESCRIPT == "Eating & Drinking"), aes(year, avg_cat, color = violcats)) +
geom_line() +
ggtitle("Frequency of Violation Categories for Eating & Drinking Restuarants") +
scale_x_continuous(name = "Year") +
scale_y_continuous(name = "Percent of Inspections Failed by Violation Category", limits = c(0, 0.6))
ggplot(filter(finalviol_cat, DESCRIPT == "Eating & Drinking w/ Take Out"), aes(year, avg_cat, color = violcats)) +
geom_line() +
ggtitle("Frequency of Violation Categories for Eating & Drinking w/ Take Out Restuarants") +
scale_x_continuous(name = "Year") +
scale_y_continuous(name = "Percent of Inspections Failed by Violation Category", limits = c(0, 0.6))
ggplot(filter(finalviol_cat, DESCRIPT == "Mobile Food Walk On"), aes(year, avg_cat, color = violcats)) +
geom_line() +
ggtitle("Frequency of Violation Categories for Mobile Food Walk On Restuarants") +
scale_x_continuous(name = "Year") +
scale_y_continuous(name = "Percent of Inspections Failed by Violation Category", limits = c(0, 1.0))
ggplot(filter(finalviol_cat, DESCRIPT == "Retail Food"), aes(year, avg_cat, color = violcats)) +
geom_line() +
ggtitle("Frequency of Violation Categories for Retail Food Restuarants") +
scale_x_continuous(name = "Year") +
scale_y_continuous(name = "Percent of Inspections Failed by Violation Category", limits = c(0, 0.7))
Across all restaurant types, the frequency of failing by violation category is approximately the same across violation categories across all years (generally within about a five percentage point band).
finalviol_violcats <- finalviol %>%
group_by(violcats, name, ZIP, LICENSENO, year) %>%
summarize(fails = sum(violstatus_fail), num_cat = n()) %>%
mutate(anyfails = ifelse(fails > 0, 1, 0), pcnt_fails = fails/num_cat)
finalviol_violcats %>% group_by(violcats) %>%
summarize(mean(anyfails), n())
## # A tibble: 13 x 3
## violcats `mean(anyfails)` `n()`
## <fct> <dbl> <int>
## 1 Dressing Rooms 0.951 324
## 2 Floors, Walls, and Ceilings 0.982 14463
## 3 Food 0.965 4377
## 4 Food Equipment and Utensils 0.983 18507
## 5 Food Protection 0.976 10897
## 6 Garbage and Refuse Disposal 0.961 2032
## 7 Insect, Rodent, Animal Control 0.968 4824
## 8 Lighting 0.973 1431
## 9 Others 0.974 7397
## 10 Personnel 0.974 6085
## 11 Plumbing 0.960 348
## 12 Toilet and hand-washing facilities 0.970 8961
## 13 Water 0.942 658
Across all violation categories, there are high rates of reatuarants failing at least one inspection
finalviol_catyear <- finalviol_violcats %>%
filter(anyfails > 0) %>%
group_by(violcats, year) %>%
summarize(avg_pcnt_fails = mean(pcnt_fails), count = n())
ggplot(finalviol_catyear, aes(year, avg_pcnt_fails, color = violcats)) +
geom_line() +
scale_x_continuous(name = "Year") +
scale_y_continuous(name = "Percent of Inspections Failed", limits = c(0, 0.7))
Among restaurants that failed in at least one category, there were approximately
finalviol_repeat <- finalviol %>%
group_by(name, DESCRIPT, ZIP, violcats) %>%
summarize(mult_fails = sum(violstatus_fail)) %>%
mutate(violrepeat = ifelse(mult_fails > 1, 1, 0))
Based on the data, the ten most common repeat offenders within the same violation category are: 1. Subway (02115) 2. Cosi (02110) 3. Dunkin Donuts (02125) 4. Dunkin Donuts (02128) 5. Dunkin Donuts (02122) 6. McDonalds (02135) 7. U Food Grill (02128) 8. Burger King (02128) 9. Dunkin Donuts RMG (02128) 10. Currito Burrito (02128)
Below attached an exploratory analysis on potentially different types of violations amongst Dunkin Donuts franchises. There does not seem to be a clear pattern here.
============================================================================================================ #Analysis based on geographic area Our goal in this case is to analyze the trend of health inspection violations based on the geographical area. From the heatmap below, we could see that most of the inspections happen in north part of Boston, around the airport area.
finalviol <- read.csv("https://raw.githubusercontent.com/plimasalle/Food_Project/master/finalviol.csv", header=T, na.strings=c(""," ","NA"))
For simplicity’s sake, I am looking at which zipcodes had the most inspections overall.
Please download the zipcode package in R.
library(zipcode)
data(zipcode)
zipcode <- zipcode %>% filter(state == "MA")
Please download the ggmap package in R. Note: The output will be a url. To access the map, copy and paste the url but replace the xxx at the end of the url with the access code listed below.
if(!requireNamespace("devtools")) install.packages("devtools")
## Loading required namespace: devtools
devtools::install_github("dkahle/ggmap", ref = "tidyup")
## Skipping install of 'ggmap' from a github remote, the SHA1 (4dfe5165) has not changed since last install.
## Use `force = TRUE` to force installation
library(ggmap)
## Google Maps API Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it: see citation("ggmap") for details.
register_google(key = "AIzaSyBJ7I6jURgmb0noA4zA4w6SZHsmjTneEoE", account_type = "premium", day_limit = 100000)
Bos_map <- ggmap(get_map(location = c(lon = -71.0589, lat = 42.3601), zoom=12, source = "google", col="bw"))
## Source : https://maps.googleapis.com/maps/api/staticmap?center=42.3601,-71.0589&zoom=12&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx
Please download the rgdal package in R. install.packages(“rgdal”, source= “https://trac.osgeo.org/gdal/wiki/DownloadSource”)
library(rgdal)
## Loading required package: sp
## rgdal: version: 1.3-6, (SVN revision 773)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 2.1.3, released 2017/20/01
## Path to GDAL shared files: /Library/Frameworks/R.framework/Versions/3.5/Resources/library/rgdal/gdal
## GDAL binary built with GEOS: FALSE
## Loaded PROJ.4 runtime: Rel. 4.9.3, 15 August 2016, [PJ_VERSION: 493]
## Path to PROJ.4 shared files: /Library/Frameworks/R.framework/Versions/3.5/Resources/library/rgdal/proj
## Linking to sp version: 1.3-1
boston_zips <- readOGR("~/Desktop/ZIP_Codes", "ZIP_Codes")
## OGR data source with driver: ESRI Shapefile
## Source: "/Users/csiego1/Desktop/ZIP_Codes", layer: "ZIP_Codes"
## with 43 features
## It has 4 fields
## Integer64 fields read as strings: OBJECTID
shape_to_ggplot <- function(shape){
require(broom)
gg_data <- tidy(shape)
data <- slot(shape, "data")
shape[["polyID"]] <- sapply(slot(shape, "polygons"), function(x) slot(x, "ID"))
gg_data <- merge(gg_data, shape, by.x="id", by.y="polyID")
return(gg_data)
}
boston_zips_ggplot <- shape_to_ggplot(boston_zips)
## Loading required package: broom
## Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Regions defined for each Polygons
colnames(boston_zips_ggplot)[2] <- "lon"
ggplot() + geom_polygon(data=boston_zips_ggplot, aes(x=lon, y=lat, group=group), fill="white", color="black")
(a) Join map and shapefile dataframes by zipcode (b) Overlay the two maps together (c) Create heat map visualization (d) Add labels (NOTE: I wasn’t able to align the two maps perfectly, but I think the final output still works for this project)
geotrends$ZIP_2 <- paste0("0", as.character(geotrends$ZIP))
boston_zips_ggplot_2 <- left_join(boston_zips_ggplot, geotrends, by=c("ZIP5"="ZIP_2"))
## Warning: Column `ZIP5`/`ZIP_2` joining factor and character vector,
## coercing into character vector
polygon_map <- ggplot() + geom_polygon(data=boston_zips_ggplot_2, aes(x=lat, y=lon, group=group, fill=number_inspection), color="black", alpha=0.7) + geom_point(data=geotrends, aes(x=Lat, y=Long))
FINAL <- Bos_map + geom_polygon(data=boston_zips_ggplot_2, aes(y=lat, x=lon, group=group, fill=number_inspection), color="black", alpha=0.7) + scale_fill_gradient2("Number of Times Inspected (2006 - 2018)", low = "blue", mid = "white", high = "red", midpoint = 0) + ggtitle("Total Food Inspections in Boston Eateries (by zip code)") + xlim(-71.15, -70.95) + ylim(42.28, 42.44)
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
FINAL
## Warning: Removed 1 rows containing missing values (geom_rect).